1. Introduction

1.1 Objectives

The objective of the present work is to develop a smart keyboard to enable people to be more effective on their mobile devices. A predictive text model has been developed, giving the user of mobile device three options for what the next word might be.

1.2 Context

To develop such a model, a large corpus of text documents has been created by merging three different types of english sources: blogs, news and twitts. The raw data, Capstone Data Set, was provided by John Hopkins University and the whole code used for creating this report and the proposed model is available on Github.

1.3 Summary

1.3.1 Research question

The research question is centerd on : “How can an efficient text predicitve model be developed on the base of publicly available data such as blogs, news wires and tweets ?”. It then implies that the methodology developed in this work can be replicated in any language, if needed.

1.3.2 Conclusion

1.3.3 Outline of the report

2. Sections

2.1 Data

2.1.1 Raw data

The data is composed of more that four millions documents, the extact total being 4’269’678. The following table indicates the different statistics related to the three different file sources. The results highlight that some blog documents appear to be very long when compared to the medians of all types of documents.

2.1.2 Data characteristics

A preliminary investigation was conducted to understand data properties, patterns and suggest modelling strategies. The following histograms demonstrate the distribution of words according to the different file sources.

The distribution of blogs documents is positively skewed (right-skewed) highlighting the fact that a few blogs contain a lot of words. It further indicates the characteristics of a poisson distribution.

The distribution of news documents appear to be slightly bimodal. The sharp contrast of both with the distribution of tweets that are much shorter in termes of number of words per document should be noted.

The following violin plots show the full distribution of words across each source. The probability density is shown at different values. The median, interquartiles ranges and other statistics can be consulted by hovering on them.

2.1.3 Data pre-processing

The data pre-processing sequence is the following:

  • Non-wanted character removals: All non-essential characters were first removed with regular expressions to facilitate analysis and model development (e.g.: “>,<, =, ~, #”). Numbers separated by “-” were also deleted.
  • Corpus creation: All the documents from the three different sources were merged into a single corpus.
  • Tokenization: All texts contained in the corpus were further separated into into smaller units called tokens which can be words, characters or subwords.
  • Token were converted to lower case
  • Token removal
    • Punctuation, symbols such as emoji, urls, separators and isolated numbers were removed.
    • All profanity words were deleted. The profanity filter built used this data source
    • All stopwords were removed.
    • A personal dictionary was built to contain other unsignificant words needing to be deleted After following this pre-processing sequence, the data contained in the corpus were deemed ready for further analysis and model development.

2.2 Methods

2.3 Analysis

2.3.1 Exploratory analysis

  • Corpus size
  • Balance, representativeness and sampling

2.4 Results

3. Conclusion

3.1 Research question addressed

3.2 Results obtained

3.3 Recommendations

4. Appendix

4.1 Details of data and process

4.2 References

data source